Add Zenflow code for Stage 1 & 2 by Antlera · Pull Request #7391 · deepspeedai/DeepSpeed

Antlera · 2025-06-27T00:42:22Z

This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls.

Highlights:

New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW)
ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration
Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig
Unit tests and documentation included

Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR.

tohtana

Hi @Antlera, thank you for submitting a great PR!

I added some comments. Overall, I think we need to separate ZenFlow code and minimize changes for ZenFlow in existing code

deepspeed/runtime/engine.py

deepspeed/runtime/zero/offload_config.py

tests/unit/runtime/zenflow/test_zf_config.py

Antlera · 2025-06-27T21:48:11Z

Hi @Antlera, thank you for submitting a great PR!

I added some comments. Overall, I think we need to separate ZenFlow code and minimize changes for ZenFlow in existing code

Hi @tohtana, thank you for the thoughtful review and suggestions!

I tried my best to avoid adding ZenFlow logic directly into engine and zero optimizer. But for some shared functions like average_tensor, fully separating it would mean rewriting a large function with mostly duplicated code, which might make future maintenance harder when upstream changes.

I’m happy to improve this further if this is considered a better practice — I’m just not entirely sure if full separation is the right trade-off here.

tohtana · 2025-06-27T21:55:52Z

Hi @Antlera,
I agree that it wouldn't be a good idea to try full separation. The engine and optimizer are not currently designed for the flexible extensions.
Can you first try to separate the parts I mentioned first? Then we can discuss if we have a chance to do more. If you have any concern, please share it here.

- Add ZenFlowCPUAdam and ZenFlowSelectiveAdamW for selective updates - Implement ZenFlowZeroOptimizer and its parallel variant - Support gradient offloading and communication overlap - Implement (un)flatten ops for column-major layout Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>

- Define ZenFlowConfig with support for selective update parameters - Add validation for ZenFlow-related config fields Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>

- Implement ZenFlow configuration and optimizer support in DeepSpeedEngine - Introduce methods for configuring ZenFlow parameters and handling selective updates - Enhance optimizer selection logic to accommodate ZenFlow optimizers - Update step function to manage ZenFlow-specific behaviors during training Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>

- Introduce tests to validate the behavior of DeepSpeedZeroConfig with various configurations for ZenFlowConfig, including stage enumeration and offload optimizer settings. - Ensure proper coercion of dictionary inputs into ZenFlowConfig and validate error handling for incorrect types. - Test combined usage of offload_optimizer and zenflow configurations under stage 2. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

- Fix initialization logic for ZenFlowCPUAdam - Fix gradient update issues in ZenFlowSelectiveAdamW Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>

- Introduce tests for ZenFlowSelectiveAdamW covering both offload and non-offload modes. - Validate step and group_step behavior with selected index updates and temporary parameter storage. - Ensure correct handling of 1D and 2D parameters, as well as proper gradient/state cleanup after updates. - Verify state increment logic and compatibility with PyTorch's native AdamW for numerical correctness. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>

- Introduce a new tutorial for ZenFlow, detailing its configuration and usage in DeepSpeed. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

- Updated methods to accept communication_data_type as a parameter for better handling of IPG buckets. - Removed debug print statements to clean up the code. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>

- Move `_configure_zenflow` logic to a standalone `configure_zenflow()` function in `zenflow_utils.py` - Refactor ZenFlow place to decouple it from ZeRO internals Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

- Simplify the `_configure_zenflow` method by assigning it a lambda function that calls `configure_zenflow(self)`. - Update the optimizer's selective learning rate synchronization to directly reference `self.optimizer._sync_selective_optimizer_lr()`. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

deepspeed/runtime/engine.py

- Fixed the invocation of `reduce_gradients` in ZenFlow + ZeRO Stage 1 - Corrected the reduction logic in `extra_large_grad_reduc` to handle gradient aggregation properly - Fixed a bug where ZenFlow could not initialize if the user did not provide a dataset Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>

- Implemented single-GPU and distributed tests for ZenFlow with ZeRO Stage 1 and 2 - Covered various configurations of selective optimizer offloading, selection strategies (auto/step/epoch), update intervals, and warm-up rounds - Ensured ZenFlow can initialize and train under different parameter combinations Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>

…enflow_z1_2

deepspeed/runtime/zenflow/zenflow_stage_1_and_2.py

deepspeed/runtime/zenflow/zenflow_utils.py

tests/unit/ops/adam/test_zf_torch_adam.py

tests/unit/runtime/zenflow/test_zf.py

tests/unit/runtime/zenflow/test_zf_config.py

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

Antlera · 2025-08-10T18:04:58Z

@sfc-gh-truwase All copyright issues have been fixed.

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Guokai Ma <guokai.ma@gmail.com>

sfc-gh-truwase · 2025-08-11T15:32:40Z

@delock do you have additional concerns or can we merge this? Thanks

Antlera · 2025-08-11T18:13:19Z

The cpu-torch-latest CI is failing because PyTorch has released version 2.8, while the workflow’s pytest invocation still expects --torch_ver="2.7". See HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS --forked -n 4 unit/ --torch_ver="2.7".

Since the workflow installs the latest CPU build by default, it pulled 2.8.0+cpu, which caused the version check in tests/conftest.py to fail and exit.

=================================== FAILURES ===================================
______ TestMultipleModels.test_zero_optimizer[True-False-False-2-False-3] ______
[gw3] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu
___________________ TestNoSyncCtxt.test_zero_stage[0-dtype2] ___________________
[gw0] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu
______ TestMultipleModels.test_zero_optimizer[True-False-False-2-False-2] ______
[gw1] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu

sfc-gh-truwase · 2025-08-11T18:28:33Z

@Antlera should be fixed by #7481

Antlera · 2025-08-11T19:03:09Z

@sfc-gh-truwase Merged for checking the new ut.

Antlera · 2025-08-12T01:34:11Z

@sfc-gh-truwase Not sure about what happen to the new up-coming errors in the CIs.

It says

Run modal run -m ci.torch_latest
╭─ Error ──────────────────────────────────────────────────────────────────────╮
│ Token missing. Could not authenticate client. If you have token credentials, │
│ see modal.com/docs/reference/modal.config for setup help. If you are a new   │
│ user, register an account at modal.com, then run `modal token new`.          │
╰──────────────────────────────────────────────────────────────────────────────╯
Error: Process completed with exit code 1.

Antlera · 2025-08-12T01:42:07Z

This might be relevant to #7289. Possible problem: The CI failures on forked PRs are due to Modal authentication.
modal run -m ci.torch_latest requires a token stored in the repository’s secrets, but forked PRs cannot access these secrets, resulting in a “Token missing” error.

Antlera · 2025-08-12T01:43:16Z

Merged for checking the new CI. Maybe re-run it will solve the problem. I assume this will make this branch up-to-date.

This PR adds a blog post and images for ZenFlow, introducing its design, benefits, and usage. The blog explains how ZenFlow improves GPU utilization by overlapping computation and communication during offloaded training. See also: deepspeedai#7391 – core ZenFlow implementation. [deepspeedai#982](deepspeedai/DeepSpeedExamples#982) - – benchmarking and fine-tuning example. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Signed-off-by: lym <letusgo126@126.com>

This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls. Highlights: - New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW) - ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration - Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig - Unit tests and documentation included Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Guokai Ma <guokai.ma@gmail.com> Signed-off-by: lym <letusgo126@126.com>

This PR adds a blog post and images for ZenFlow, introducing its design, benefits, and usage. The blog explains how ZenFlow improves GPU utilization by overlapping computation and communication during offloaded training. See also: deepspeedai#7391 – core ZenFlow implementation. [deepspeedai#982](deepspeedai/DeepSpeedExamples#982) - – benchmarking and fine-tuning example. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>

This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls. Highlights: - New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW) - ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration - Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig - Unit tests and documentation included Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Guokai Ma <guokai.ma@gmail.com>

This PR completes the ZenFlow integration for DeepSpeed ZeRO Stage 3. Highlights: - ZenFlowSelectiveAdamW_stage3: Optimizer with importance-aware selective parameter updates for ZeRO Stage 3. - ZenFlowZeroOptimizer_Stage3: Full Stage 3 optimizer integration with partitioned parameters and CPU offload. - Configurable via ZenFlowConfig, fully integrated with DeepSpeedZeroConfig for Stage 3. - Unit tests for Stage 3 cases ensuring correctness and compatibility. Note: Intergration with ZeRO Stage 1&2 was introduced in #7391 --------- Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Ma, Guokai <guokai.ma@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>

Antlera requested review from loadams, tjruwase and tohtana as code owners June 27, 2025 00:42

Antlera force-pushed the zenflow_zero1_2 branch from 32a9ff9 to 53c564d Compare June 27, 2025 15:27

Antlera requested review from GuanhuaWang, hwchen2017 and jomayeri as code owners June 27, 2025 15:27

Antlera force-pushed the zenflow_zero1_2 branch from 53c564d to 32a9ff9 Compare June 27, 2025 15:32

tohtana reviewed Jun 27, 2025

View reviewed changes

Antlera and others added 12 commits June 28, 2025 00:27

Add ZenFlowConfig for optimizer configuration

4e9fe2a

- Define ZenFlowConfig with support for selective update parameters - Add validation for ZenFlow-related config fields Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>

Add ZenFlow tutorial documentation

f534d5e

- Introduce a new tutorial for ZenFlow, detailing its configuration and usage in DeepSpeed. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>

Format code

80ad488

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

Fix check_grad_overflow parameter in ZenFlowZeroOptimizer

9c05ccb

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

Merge remote-tracking branch 'upstream/master' into zenflow_zero1_2

417932a

Refactor ZenFlow integration in DeepSpeedEngine

fee24ff

- Move `_configure_zenflow` logic to a standalone `configure_zenflow()` function in `zenflow_utils.py` - Refactor ZenFlow place to decouple it from ZeRO internals Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

Antlera force-pushed the zenflow_zero1_2 branch from 611fbe8 to fee24ff Compare June 28, 2025 04:57

tohtana reviewed Jun 30, 2025

View reviewed changes

deepspeed/runtime/engine.py Outdated Show resolved Hide resolved

deepspeed/runtime/engine.py Show resolved Hide resolved

deepspeed/runtime/engine.py Outdated Show resolved Hide resolved

deepspeed/runtime/engine.py Show resolved Hide resolved

tohtana and others added 4 commits June 30, 2025 13:29

Merge branch 'master' into zenflow_zero1_2

9aac3c0

Merge branch 'zenflow_zero1_2' of github.com:Antlera/DeepSpeed into z…

fad8498

…enflow_z1_2